## **Installation**

**Set up environment.**

```bash
cd LangRepo
conda create -n langrepo python=3.10 -y
conda activate langrepo
pip install openai pandas transformers accelerate sentence-transformers
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```

**Download pre-extracted captions.**

Download pre-extrated captions provided in LLoVi github repo, using [this drive](https://drive.google.com/file/d/13M10CB5ePPVlycn754_ff3CwnpPtDfJA/view?usp=drive_link). Please refer to the same for the details about how captions are extracted. Unzip the captions in ```./data```.

**Download models.**

Download [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model checkpoints, following the instructions on huggingface. Place the checkpoints in ```./hf_ckpt```.

## EgoSchema

**Create repository.**

```bash
python main_repo.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--text_encode clip \
--dataset egoschema \
--output_base_path output/egoschema/rep \
--output_filename m7b_rephrase_egoschema.json \
--num_examples_to_run -1 \
--task sum \
--prompt_type rephrase_sum_mistral \
--num_iterations 1 \
--num_chunks [4] \
--merge_ratio 0.25 \
--dst_stride 4 \
--num_words_in_rephrase 20 \
--num_words_in_sum 500 \
--read_scales [-4,-3,-2,-1]
```

**Answer multiple-choice questions.**

```bash
python main_ll_eval.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--dataset egoschema \
--output_base_path output/egoschema \
--output_filename m7b_lleval_egoschema.json \
--data_path output/egoschema/rep/m7b_rephrase_egoschema_data.json \
--num_examples_to_run -1 \
--prompt_type qa_ll_mistral
```

## NExT-QA

**Create repository.**

```bash
python main_repo.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--text_encode clip \
--dataset nextqa \
--output_base_path output/nextqa/rep \
--output_filename m7b_rephrase_nextqa.json \
--data_path data/nextqa/llava1.5_fps1.json \
--anno_path data/nextqa/val.csv \
--duration_path  data/nextqa/durations.json \
--num_examples_to_run -1 \
--task sum \
--prompt_type rephrase_sum_mistral \
--num_iterations 2 \
--num_chunks [2,2] \
--merge_ratio 0.25 \
--dst_stride 2 \
--num_words_in_rephrase 20 \
--num_words_in_sum 500 \
--read_scales [-2,-1]
```

**Answer multiple-choice questions.**

```bash
python main_ll_eval.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--dataset nextqa \
--output_base_path output/nextqa \
--output_filename m7b_lleval_nextqa.json \
--data_path output/nextqa/rep/m7b_rephrase_nextqa_data.json \
--anno_path data/nextqa/val.csv \
--duration_path  data/nextqa/durations.json \
--num_examples_to_run -1 \
--prompt_type qa_ll_mistral_nextqa
```

## IntentQA

**Create repository.**

```bash
python main_repo.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--text_encode clip \
--dataset intentqa \
--output_base_path output/intentqa/rep \
--output_filename m7b_rephrase_intentqa.json \
--data_path data/nextqa/llava1.5_fps1.json \
--anno_path data/intentqa/test.csv \
--duration_path  data/nextqa/durations.json \
--num_examples_to_run -1 \
--task sum \
--prompt_type rephrase_sum_mistral \
--num_iterations 1 \
--num_chunks [1] \
--merge_ratio 0.25 \
--dst_stride 4 \
--num_words_in_rephrase 20 \
--num_words_in_sum 500 \
--read_scales [-1]
```

**Answer multiple-choice questions.**

```bash
python main_ll_eval.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--dataset intentqa \
--output_base_path output/intentqa \
--output_filename m7b_lleval_intentqa.json \
--data_path output/intentqa/rep/m7b_rephrase_intentqa_data.json \
--anno_path data/intentqa/test.csv \
--duration_path  data/nextqa/durations.json \
--num_examples_to_run -1 \
--prompt_type qa_ll_mistral_nextqa
```

## NExT-GQA

**Create repository.**

```bash
python main_repo.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--text_encode clip \
--dataset nextgqa \
--output_base_path output/nextgqa/rep \
--output_filename m7b_rephrase_nextgqa.json \
--data_path data/nextqa/llava1.5_fps1.json \
--anno_path data/nextgqa/test.csv \
--duration_path  data/nextqa/durations.json \
--num_examples_to_run -1 \
--task sum \
--prompt_type rephrase_sum_mistral \
--num_iterations 2 \
--num_chunks [2,1] \
--merge_ratio 0.25 \
--dst_stride 2 \
--num_words_in_rephrase 20 \
--num_words_in_sum 250 \
--read_scales [-3,-2,-1]
```

**Answer multiple-choice questions.**

```bash
python main_ll_eval.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--dataset nextgqa \
--output_base_path output/nextgqa \
--output_filename m7b_lleval_nextgqa.json \
--data_path output/nextgqa/rep/m7b_rephrase_nextgqa_data.json \
--anno_path data/nextgqa/test.csv \
--duration_path  data/nextqa/durations.json \
--num_examples_to_run -1 \
--prompt_type qa_ll_mistral_nextqa
```

**Ground answers in time.**

```bash
python main.py \
--model ./hf_ckpt/Mistral-7B-Instruct-v0.2/ \
--dataset nextgqa \
--output_base_path output/nextgqa \
--output_filename m7b_grounding_nextgqa.json \
--data_path data/nextqa/llava1.5_fps1.json \
--anno_path data/nextgqa/test.csv \
--duration_path  data/nextqa/durations.json \
--nextgqa_gt_ground_path data/nextgqa/gsub_test.json \
--nextgqa_pred_qa_path output/nextgqa/m7b_lleval_nextgqa.json \
--num_examples_to_run -1 \
--task gqa \
--prompt_type gqa_mistral
```
